DNN-Based Speech Synthesis: Importance of Input Features and Training Data

نویسندگان

Alexandros Lazaridis

Blaise Potard

Philip N. Garner

چکیده

Deep neural networks (DNNs) have been recently introduced in speech synthesis. In this paper, an investigation on the importance of input features and training data on speaker dependent (SD) DNN-based speech synthesis is presented. Various aspects of the training procedure of DNNs are investigated in this work. Additionally, several training sets of different size (i.e., 13.5, 3.6 and 1.5 h of speech) are evaluated.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An investigation of context clustering for statistical speech synthesis with deep neural network

The state-of-the-art DNN speech synthesis system directly maps linguistic input to acoustic output and voice quality improvement over the conventional MSD-GMM-HMM synthesis system has been reported. DNN-based speech synthesis system does not require context clustering as in GMM-HMM systems and this was believed to be the main advantage and contributor to performance improvement. Our previous wo...

متن کامل

A comparison of speech synthesis systems based on GPR, HMM, and DNN with a small amount of training data

In this paper, we evaluate a framework of statistical parametric speech synthesis based on Gaussian process regression (GPR) and compare it with those based on hidden Markov model (HMM) and deep neural network (DNN). Recently, for the purpose of improving the performance of HMM-based speech synthesis, novel frameworks using deep architectures have been proposed and have shown their effectivenes...

متن کامل

An Investigation of DNN-Based Speech Synthesis Using Speaker Codes

Recent studies have shown that DNN-based speech synthesis can produce more natural synthesized speech than the conventional HMM-based speech synthesis. However, an open problem remains as to whether the synthesized speech quality can be improved by utilizing a multi-speaker speech corpus. To address this problem, this paper proposes DNN-based speech synthesis using speaker codes as a simple met...

متن کامل

Uncertainty training and decoding methods of deep neural networks based on stochastic representation of enhanced features

Speech enhancement is an important front-end technique to improve automatic speech recognition (ASR) in noisy environments. However, the wrong noise suppression of speech enhancement often causes additional distortions in speech signals, which degrades the ASR performance. To compensate the distortions, ASR needs to consider the uncertainty of enhanced features, which can be achieved by using t...

متن کامل

Sequence generation error (SGE) minimization based deep neural networks training for text-to-speech synthesis

Feed-forward deep neural networks (DNNs) based text-tospeech (TTS) synthesis, which employs a multi-layered structure to exploit the statistical correlations between rich contextual information and the corresponding acoustic features, has been shown to outperform a decision tree based, GMM-HMM counterpart. However, the DNN-based TTS training has not taken the whole sequence, i.e., sentence, int...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2015

DNN-Based Speech Synthesis: Importance of Input Features and Training Data

نویسندگان

چکیده

منابع مشابه

An investigation of context clustering for statistical speech synthesis with deep neural network

A comparison of speech synthesis systems based on GPR, HMM, and DNN with a small amount of training data

An Investigation of DNN-Based Speech Synthesis Using Speaker Codes

Uncertainty training and decoding methods of deep neural networks based on stochastic representation of enhanced features

Sequence generation error (SGE) minimization based deep neural networks training for text-to-speech synthesis

عنوان ژورنال:

اشتراک گذاری